1. Previous Important algorithms IA

2. Recurrent Neural Network RNN

It is designed to process sequential data. RNNs maintain a hidden state that is updated at each time step, allowing the model to remember information from previous inputs. The basic equation is:

ht=activation(Wxhxt+Whhht1+b)h_t = \text{activation}\left( W_{xh} \cdot x_t + W_{hh} \cdot h_{t-1} + b \right)

Where:

RNNs form the foundation for many modern AI applications by enabling the processing of sequences with context. But they have limitations in remembering long-term information due to the vanishing gradient.

3. Long Short-Term Memory (LSTM)

Designed to improve memory from an RNN, it decides what to remember using 3 gates (forget, input, and output), maintains a cell state for long-term memory, and updates its state using a candidate cell and a hidden value. This improvements make this achievement more slowly

[ftitotc~t]=[σσσtanh](W[xtht1]+b)\begin{bmatrix} f_t \\ i_t \\ o_t \\ \tilde{c}_t \end{bmatrix} = \begin{bmatrix} \sigma \\ \sigma \\ \sigma \\ \tanh \end{bmatrix} \left( W \begin{bmatrix} x_t \\ h_{t-1} \end{bmatrix} + b \right)

Where:

Forget Gate

The forget gate decides which information from the previous cell state should be discarded. It is calculated as:

ft=σ(Wf[ht1,xt]+bf)f_t = \sigma(W_f \cdot [h_{t-1}, x_t] + b_f)

Where:

  • ftf_t: The forget gate output at time tt.
  • WfW_f: Weight matrix for the forget gate.
  • ht1h_{t-1}: The hidden state at the previous time step.
  • xtx_t: The input at time tt.
  • bfb_f: The bias term for the forget gate.
  • σ\sigma: The sigmoid activation function.
Input Gate

The input gate decides which values from the current input and the previous hidden state will update the cell state. It is calculated as:

it=σ(Wi[ht1,xt]+bi)i_t = \sigma(W_i \cdot [h_{t-1}, x_t] + b_i)

Where:

  • iti_t: The input gate output at time tt.
  • WiW_i: Weight matrix for the input gate.
  • ht1h_{t-1}: The previous hidden state.
  • xtx_t: The current input.
  • bib_i: The bias term for the input gate.
  • σ\sigma: The sigmoid activation function.
Output Gate

The output gate decides what the next hidden state will be, based on the current cell state. It is calculated as:

ot=σ(Wo[ht1,xt]+bo)o_t = \sigma(W_o \cdot [h_{t-1}, x_t] + b_o)

Where:

  • oto_t: The output gate output at time tt.
  • WoW_o: Weight matrix for the output gate.
  • ht1h_{t-1}: The previous hidden state.
  • xtx_t: The current input.
  • bob_o: The bias term for the output gate.
  • σ\sigma: The sigmoid activation function.
Candidate Cell State
Ct=tanh(Wci[ht1,xt]+bc)C_t = \tanh(W_{ci} [h_{t-1}, x_t] + b_c)

Where:

  • CtC_t is the candidate cell state at time step tt.
  • WciW_{ci} is the weight matrix for the candidate cell state.
  • ht1h_{t-1} is the previous hidden state.
  • xtx_t is the current input.
  • bcb_c is the bias term for the candidate cell state.
Cell State Update
Ct=ftCt1+itC~tC_t = f_t * C_{t-1} + i_t * \tilde{C}_t

Where:

  • CtC_t is the current cell state at time step tt.
  • ftf_t is the forget gate’s output at time step tt.
  • Ct1C_{t-1} is the previous cell state.
  • iti_t is the input gate’s output at time step tt.
  • C~t\tilde{C}_t is the candidate cell state.
Hidden State Update
ht=ottanh(Ct)h_t = o_t * \tanh(C_t)

Where:

  • hth_t is the hidden state at time step tt.
  • oto_t is the output gate’s output at time step tt.
  • CtC_t is the current cell state at time step tt.

4. Gated Recurrent Unit (GRU)

Designed to simplify the LSTM by using only an update gate and a reset gate, making it more efficient. It merges cell and hidden states into one, reducing complexity while keeping strong performance.

zt=σ(Wzxt+Uzht1+bz)rt=σ(Wrxt+Urht1+br)h~t=tanh(Whxt+Uh(rtht1)+bh)ht=(1zt)ht1+zth~t\begin{aligned} z_t &= \sigma(W_z x_t + U_z h_{t-1} + b_z) \\ r_t &= \sigma(W_r x_t + U_r h_{t-1} + b_r) \\ \tilde{h}_t &= \tanh(W_h x_t + U_h (r_t \circ h_{t-1}) + b_h) \\ h_t &= (1 - z_t) \circ h_{t-1} + z_t \circ \tilde{h}_t \end{aligned}

Where:

Then, there were other improvements, like processing the sequence of data from right to left and in both directions. Finally, a scoring mechanism was introduced to compare the last element with the sequence and understand the importance of each word. This improved the handling of sequence dependencies, addressing the memory problem (which is the basis of transformers).

5. Transformers

How is this used?

After training, the model holds a matrix of attention scores. You input a sentence, and based on these learned weights, the model predicts the next word step by step.